Skip to content

feat: write nested struct layouts#4942

Merged
a10y merged 21 commits into
developfrom
aduffy/nest1
Oct 28, 2025
Merged

feat: write nested struct layouts#4942
a10y merged 21 commits into
developfrom
aduffy/nest1

Conversation

@a10y

@a10y a10y commented Oct 14, 2025

Copy link
Copy Markdown
Contributor

Part of #4889

To better support writing nested data, here we update our builtin StructLayout in a backwards-compatible fashion.

  • StructStrategy will shred struct fields into their own StructLayout recurseively
  • StructLayout can support nullable structs now. It does this by writing a new child layout containing the validity buffer for nullable arrays

I use the RealNest dataset to evaluate, which contains a copy of ~200k github pull request webhook events. Nested struct layout reduces file size over the previous strategy by about ~10%, and also makes pushdown into the nested columns possible.

Some open questions

  • The validity child requires some extra handling. It seems like the validity handling is very dependent on the expression being pushed down. For example if I'm doing a simple project of a child field, then adding the validity to the result is a simple masking operation. If I'm pushing down an UNNEST or something else that increases the result size, it is hard to map the validity buffer onto the projection_eval result
  • I collect all validity chunks into a single buffer at write time. The idea being that it's better to access the struct validity as a single unit since it is much smaller than the data size. Assuming an 8MB target segment size, this lets us comfortably fit ~64mm rows into a single segment. Another alternative is to bring back roaring, or enable some other boolean compressors.

Loading
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

changelog/feature A new feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants